14 . Genome Assembly and Annotation Process
نویسنده
چکیده
The primary data produced by genome sequencing projects are often highly fragmented and sparsely annotated. This is especially true for the Human Genome Project [http://www.genome.gov/ page.cfm?pageID=10001772] as a result of its policy of releasing sequence data to the public sequence databases every day (1, 2). So that individual researchers do not have to piece together extended segments of a genome and then relate the sequence to genetic maps and known genes, NCBI provides annotated assemblies of public genome sequence data. NCBI assimilates data of various types, from numerous sources, to provide an integrated view of a genome, making it easier for researchers to spot informative relationships that might not have been apparent from looking at the primary data. The annotated genomes can be explored using Map Viewer (Chapter 20) to display different types of data side-by-side and to follow links between related pieces of data. This chapter describes the series of steps, the “pipeline”, that produces NCBI's annotated genome assembly from data deposited in the public sequence databases. A variant of the annotation process developed for the human genome is used to annotate the mouse genome, and similar procedures will be applied to other genomes (Box 1). NCBI constantly strives to improve the accuracy of its human genome assembly and annotation, to make the data displays more informative, and to enhance the utility of our access tools. Each run through the assembly and annotation procedure, together with feedback from outside groups and individual users, is used to improve the process, refine the parameters for individual steps, and add new features. Consequently, the details of the assembly and annotation process change from one run to the next. This chapter, therefore, describes the overall human genome assembly and annotation process and provides short descriptions of the key steps, but it does not detail specific procedures or parameters. However, sufficient detail is provided to enable users of our assembly and annotations to become familiar with the complexities and possible limitations of the data we provide.
منابع مشابه
Draft Genome of Australian Environmental Strain WM 09.24 of the Opportunistic Human Pathogen Scedosporium aurantiacum
We report here the first genome assembly and annotation of the human-pathogenic fungus Scedosporium aurantiacum, with a predicted 10,525 genes, and 11,661 transcripts. The strain WM 09.24 was isolated from the environment at Circular Quay, Sydney, New South Wales, Australia.
متن کاملProteogenomics produces comprehensive and highly accurate protein-coding gene annotation in a complete genome assembly of Malassezia sympodialis
Complete and accurate genome assembly and annotation is a crucial foundation for comparative and functional genomics. Despite this, few complete eukaryotic genomes are available, and genome annotation remains a major challenge. Here, we present a complete genome assembly of the skin commensal yeast Malassezia sympodialis and demonstrate how proteogenomics can substantially improve gene annotati...
متن کاملWhole-Genome Sequence of Quorum-Sensing Vibrio tubiashii Strain T33
Vibrio tubiashii strain T33 was isolated from the coastal waters of Morib, Malaysia, and was shown to possess quorum-sensing activity similar to that of its famous relative Vibrio fischeri. Here, the assembly and annotation of its genome are presented.
متن کاملWhole-Genome Sequences of Three Symbiotic Endozoicomonas Bacteria
Members of the genus Endozoicomonas associate with a wide range of marine organisms. Here, we report on the whole-genome sequencing, assembly, and annotation of three Endozoicomonas type strains. These data will assist in exploring interactions between Endozoicomonas organisms and their hosts, and it will aid in the assembly of genomes from uncultivated Endozoicomonas spp.
متن کاملChapter 14: Genome Assembly and Annotation Process
The basic procedures used to annotate other eukaryotic genomes are essentially the same as those used to annotate the human genome. However, the overall process is adjusted to accommodate the different types of input data that are available for each organism. Genes can be annotated on any genome for which a significant number of mRNA, EST, or protein sequences are available. Other features, suc...
متن کامل